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Abstract 

Background: Reverse engineering gene networks and identifying regulatory interactions are integral to 
understanding cellular decision making processes. Advancement in high throughput experimental techniques has 
initiated innovative data driven analysis of gene regulatory networks. However, inherent noise associated with 
biological systems requires numerous experimental replicates for reliable conclusions. Furthermore, evidence of 
robust algorithms directly exploiting basic biological traits are few. Such algorithms are expected to be efficient in 
their performance and robust in their prediction. 

Results: We have developed a network identification algorithm to accurately infer both the topology and strength 
of regulatory interactions from time series gene expression data in the presence of significant experimental noise 
and non-linear behavior. In this novel formulism, we have addressed data variability in biological systems by 
integrating network identification with the bootstrap resampling technique, hence predicting robust interactions 
from limited experimental replicates subjected to noise. Furthermore, we have incorporated non-linearity in gene 
dynamics using the S-system formulation. The basic network identification formulation exploits the trait of sparsity 
of biological interactions. Towards that, the identification algorithm is formulated as an integer-programming 
problem by introducing binary variables for each network component. The objective function is targeted to 
minimize the network connections subjected to the constraint of maximal agreement between the experimental 
and predicted gene dynamics. The developed algorithm is validated using both in silico and experimental data-sets. 
These studies show that the algorithm can accurately predict the topology and connection strength of the in silico 
networks, as quantified by high precision and recall, and small discrepancy between the actual and predicted 
kinetic parameters. Furthermore, in both the in silico and experimental case studies, the predicted gene expression 
profiles are in very close agreement with the dynamics of the input data. 

Conclusions: Our integer programming algorithm effectively utilizes bootstrapping to identify robust gene 
regulatory networks from noisy, non-linear time-series gene expression data. With significant noise and non- 
linearities being inherent to biological systems, the present formulism, with the incorporation of network sparsity, is 
extremely relevant to gene regulatory networks, and while the formulation has been validated against in silico and 
E. Coli data, it can be applied to any biological system. 
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Background 

The progress in the field of experimental techniques in 
systems biology in recent years has contributed signifi- 
cantly to the analysis and understanding of gene regula- 
tory networks [1]. The simultaneous measurement of the 
expression levels of thousands of genes has become pos- 
sible with these techniques. The time series data of gene 
expression obtained from the high-throughput techni- 
ques typically contain comprehensive information about 
the structure of the system. However, reverse engineering 
that data for identification of interactions between genes 
and reconstruction of the regulatory network is still a 
challenging problem. 

A variety of modeling approaches have been developed 
recently for inferring genetic networks from gene expres- 
sion data. Identification algorithms are dependent on 
how the network is modeled [2], and include Boolean 
logic [3,4], Bayesian [5-7], and information-theoretic 
approaches [8]. Several approaches use steady state infor- 
mation, the data of which typically coming from "struc- 
tural perturbations" (such as gene knockout studies) [9], 
which might be difficult to obtain for some systems. Alter- 
nate approaches using time series data include dynamic 
Bayesian networks [10,11] and differential equation-based 
models [12,13]. Of the latter, initial reports on reverse engin- 
eering gene networks assumed linear model approximations 
[13,14]. While such approximations retain simplicity in the 
identification algorithm, it may be inadequate in predicting 
strongly non-linear systems. One way of representing 
non-linear gene dynamics is the S -system model, a 
power-law formulation which incorporates both pro- 
duction and degradation terms of the genes. Previous 
studies have looked into network identification of non- 
linear systems with the S -system [1,12,15-20], which pre- 
sents a more challenging task than identification of linear 
systems. In addition to non-linearities, gene regulatory 
networks are highly noisy and stochastic [21] which can 
lead to difficulties during network inference. Therefore, a 
strong need exists for robust network identification of 
non-linear systems in the presence of high system variabil- 
ity, while also being able to incorporate relevant biological 
information. 

In the current report, we model the dynamics of gene 
expression by S -system formulation. Upon doing so, we 
formulate the network identification algorithm as a 
bi-level optimization problem, governed by the hypoth- 
esis of network sparsity. Network sparsity has been ex- 
perimentally observed in various biological systems such 
as the visual system of primates [22], auditory system of 
rats [23], and olfactory system of insects, to name a few. 
The sparsest gene network has also been eluded to be a 
robust one [24]. Governed by the hypothesis of sparsity 
of network connections, the target of our network identi- 
fication algorithm is to find the network structure with 



minimum number of connections that is in agreement 
with the experimental data at an acceptable level of tol- 
erance. We have earlier proposed an optimization for- 
mulation to identify the regulatory network from time 
profiles of gene expression data [25]. The previous algo- 
rithm was based on the following approximations: (i) 
gene expression dynamics were approximated by linear 
ordinary differential equations (ode); and (ii) the system 
was treated as deterministic by considering only the 
mean experimental data for the analysis. In the present 
algorithm, we developed a novel formulism which uti- 
lizes bootstrapping to identify robust networks from 
noisy data. The aforementioned approximations are 
removed by (i) representing the gene expression profile 
with an S -system model and (ii) directly accounting for 
variability in experimental data. Our algorithm, as detailed 
in the methods section and represented in Figure 1, 
enables identification of robust networks from an inher- 
ently non-linear and noisy system. We test the perform- 
ance of our algorithm in various case studies including in 
silico and experimental data sets. 

Results 

The performance of the developed bi-level integer pro- 
gramming algorithm is demonstrated on three case stud- 
ies. In the first case study, we consider in silico gene 
expression data generated from a benchmark artificial 
5-gene network model. In the second case study, the ap- 
plicability of the algorithm on a larger network is tested 
using an in silico 10-gene network. In the third case 
study, the algorithm is applied to an experimental data 
set of the SOS DNA repair system in ExolL 

I Case Study 1: Five gene network model 

The purpose of this case study is to validate the algo- 
rithm on a small network with and without experimental 
noise. The chosen 5-gene network model [16] has been 
used as a benchmark problem by different research 
groups to test the validity of their algorithms [15,19]. 

lA Network identification without noise 

Using the S -system formulation, the 5-gene network 
model can be represented by the system of five coupled 
nonlinear ode, shown in Additional file 1 as equation 1 
[16]. In order to test our identification algorithm on this 
model we first generate in silico data by integrating these 
equations, which we use as experimental data for the 
identification algorithm. To formulate the bi-level 
optimization problem, n^ = 25 binary variables are intro- 
duced corresponding to each of the five connections. 
Genetic Algorithm (GA), used to solve the upper level 
integer programming problem, does not have a conver- 
gence criterion. Standard practice is to evolve the popu- 
lation for enough generations until no significant 
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"> doi=1:# 
bootstrap runs 



Bootstrap selection of experimental data set 



•"> doj=1: # of genetic 
algorithm generations 



-Generate k binary strings which represent network connections 
-For each GA generation, these strings are generated based on 
-previous generation's results 

-updates based on evolutionary rules (cross-over, mutations, etc.) 



do m=1: k#of binary 
strings (population) 



Perform least squared algorithm on the S-system 
-network connections are based on the m^'^ binary 
string in the population 

-estimating values of reaction orders (only those 
present in binary string) and rate constants 
-error based on bootstrap experimental data set 



"> end do 



end do 



Based on sparsity and least squared error, determine 
best fit network/parameters of j^^ generation 



Determine best fit of i^^ bootstrap run 



end do 



Compile bootstrap statistics on best fit connections and values 



Figure 1 Pseudo-code of the robust network identification algorithm implementation. 



improvement is observed. Figure 2(a) illustrates the con- 
vergence characteristics of the GA for this example; at 
over 103 generations, the optimal output remained in- 
variant. The efficiency of the algorithm depends on ap- 
propriate choice of starting population, as well as other 
involved parameters, in addition to the number of 



generations. The initial population size plays an import- 
ant role in the quality and efficiency of the algorithm. A 
small population size may lead to local convergence or 
extremely large number of generations. To avoid that a 
population size of 20 was chosen and the algorithm 
evolved for 150 generations. The crossover probability is 



a 
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Figure 2 Identification of a 5-gene network without noise, (a) Convergence study of the genetic algorithm. The number of connections 
identified in each of the solutions generated by GA is plotted. No feasible solution was found with less than 65 generations, (b) Identified 
network. Arrows represent positive regulation and the filled circles represent negative regulation of the genes. Kinetic orders of each connection 
are represented above the corresponding connecting lines and the rate constants for each gene are shown above the genes. All connections and 
parameters are consistent with the original differential equations used to generate the in silico data. 
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chosen to be at a standard value of 0.5, and the chosen 
mutation probability of 0.02 was expected to maintain 
diversity in population. Since the data contain no noise, 
the tolerance in the lower level least square optimization 
problem has been kept at a very low value (10'^). Typic- 
ally least square optimization routines are very sensitive 
to the user defined initial guess. To make sure that the 
algorithm can identify the underlying network structure 
even without any a priori information, we deliberately 
assigned the initial guess values for the least square 
optimization problem to be largely different from the ac- 
tual values, and tested the algorithm for various combi- 
nations of the initial guess. 

Figure 2(b) illustrates the 5-gene network identified 
using the above formulation. The kinetic orders (gij) are 
depicted over the connection and the kinetic rate con- 
stants (ttij, pij) are depicted in brackets. The precision 
and recall value were both a perfect 1.0, indicating the 
accuracy with which the proposed algorithm predicted 
the network structure from time profile gene expression 
data. In addition, the identified kinetic orders and rate 
constants are also in agreement with the actual network 
model presented in Additional file 1 equation (1). These 
results validate the performance of the algorithm for a 
small network under deterministic conditions. 

IB Network identification under data uncertainty 

The performance of the algorithm is next analyzed in the 
presence of experimental noise, generated by adding 5% 
Gaussian noise to the time-course data generated from 
equation (1) shown in Additional file 1. Three different 
data sets are generated in this fashion to represent three 
experimental replicates of the samples. These three data 
sets are then resampled using bootstrapping to generate 
1000 artificial data sets. The network identification algo- 
rithm was then applied at each of the data sets to gener- 
ate 1000 alternate networks. The presence of noise in 
the data restricts the accuracy by which the predicted 
profile can agree with the data. Hence, the tolerance was 
relaxed to 0.12 and the GA code was evolved for 200 
generations while retaining the population size of 20. The 
ensemble of alternate networks thus generated was ana- 
lyzed for frequency of appearance of each of the connec- 
tions (Figure 3(a)) which was hypothesized to directly 
correspond to its robustness against experimental noise. 

Figure 3(b) further illustrates the identified robust net- 
work connections screened for 45% occurrence, with fre- 
quency of occurrence of network connections being 
depicted over the connection. Quite encouragingly, the 
algorithm correctly identified all the existing connections 
in the actual network. However because of noise, the al- 
gorithm also identifies two false interactions involving 
gene 2, hence resulting in a recall and precision of 1 and 
0.78, respectively. 



The expected values of the S -system parameters esti- 
mated at 90% confidence level are represented in Table 1 
(^ij) and Table 2 {a-^y pij), which demonstrates the excel- 
lent performance of the algorithm in identifying network 
parameters even from noisy data. While the error of the 
rate constants (compared to the actual values) is rela- 
tively high, it should be noted that the results of the net- 
work identification would not be as sensitive to these 
parameters as to the connectivity values, and therefore 
the rate constant values could vary significantly and not 
affect the gene profiles or the recall/precision. Further- 
more, the error on the reaction orders (gij) is very low, 
further demonstrating the accuracy of the network iden- 
tification. The heat map in Figure 3(c) further shows the 
algorithm s effectiveness in finding a tight range of reac- 
tion orders of the robust connections in the network. 

To evaluate the accuracy of the formulism under 
increased uncertainty, the algorithm was tested under 
various amounts of added noise. As one would ex- 
pect, the accuracy of the algorithm depends on the 
level of noise added to the in silico data. Table 3 
shows this trend, with the precision and recall being 
compared with 5, 7, and 10% noise. Increasing noise 
increases the number of false negatives, thereby re- 
ducing the recall. Interestingly, precision actually 
improves with increasing noise, indicating less false 
positives. This trend seems to converge, with both 
the recall and precision holding constant at 7 and 
10%. Figure 4(a) shows the identified network with a 
data set incorporating 10% white Gaussian noise. 
The algorithm does not identify any connection 
which is not in the actual network (e.g. 0 false posi- 
tives) and is therefore able to achieve a perfect pre- 
cision. However because of noise, the algorithm also 
fails to identify three of the actual connections (false 
negatives), hence resulting in a recall of 0.57. It 
should be noted that the frequency threshold affects 
the results, and depending on the system, needs to 
be tuned. Figure 4(b) shows the sensitivity of the re- 
call and precision to this threshold value. 

While the analysis is performed on 1000 bootstrap 
samples, it is computationally expensive to solve 1000 
network identification problems. Hence, we investigated 
the sensitivity of the identified robust network on the 
number of bootstrap samples by considering a broad 
range of samples from 200 to 1000. Figure 5 illustrates 
the percentage of total number of appearances of each 
identified interaction in every 200 bootstrapped samples, 
using 5% noise. The difference in the maximum and 
minimum number of appearances is less than 8% for all 
connections. The clearly shows that as little as 200 boot- 
strap samples can be enough in drawing statistically sig- 
nificant conclusions, which is in agreement with the 
literature [26]. 
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Figure 3 Results from 5-gene network identified under data uncertainty with 5% noise, (a) Number of bootstrap occurrences for each 
connection (1000 bootstrap samples total), (b) Identified network structure. Numbers above each connection represent percent occurrence, with 
the thick lines representing the number of connections appearing in more than 90% of the bootstrapped samples and the thin lines representing 
the connections appearing in more than 45% of bootstrapped samples, (c) frequency of specific connection values shown as heat map. 



7C Deterministic network identification under data 
uncertainty 

To assess the necessity of this bootstrapping technique, 
the aforementioned results were compared to a control 
group which did not utilize bootstrapping. To do this, a 
more deterministic approach was employed. Experimental 
replicates were generated as detailed: 10% white Gaussian 
noise was added to the 5-gene in silico network. Instead of 
bootstrapping these replicates, the deterministic network 
identification was performed on the mean of the replicates. 
This was done for 3, 5, 7, 9, and 20 replicates with the 



resulting precision and recall calculated for each case; 
results are shown in Figure 6. As shown, when the input 
data is generated from fewer than seven replicates, a solu- 
tion is not found. Even with seven replicates, the results 
are relatively poor. While the recall is comparable to that 
generated from bootstrapping (-0.57), precision is much 
worse (0.5). As the number of replicates is increased, this 
precision increases; however, even at 20 replicates, preci- 
sion is not perfect (0.8). Furthermore, in practice, generat- 
ing this many experimental replicates is often not feasible. 
This illustrates that the proposed bootstrapping technique 
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Table 1 Comparison of the identified S-system reaction 
order values to actual values 



Connection 


9actual 


9estimated 


G1G3 


1 


1.2 ±0.09 


G1G5 


-1 


-1.0 ±0.03 


G2G1 


2 


2.4 ±0.04 


G2G3 


NA 


-3.4 ±0.06 


G2G4 


NA 


3.9 ±0.05 


G3G2 


-1 


-1.1 ±0.03 


G4G3 


2 


1.9 ±0.02 


G4G5 


-1 


-1.0 ±0.01 


G5G4 


2 


2.0 ±0.02 



offers an accurate way of determining robust connections 
over a more traditional method, even with limited number 
of experimental repeats. 

II Case Study 2: Ten Gene Network Model 

In this example we investigate the performance of the 
developed algorithm in a larger network consisting of 
ten genes, as depicted in equation (2) shown in 
Additional file 1. For the deterministic case study the tol- 
erance was specified at a low value of 10'^. Because the 
10-gene network increases the number of binary vari- 
ables in the upper level to 100, more GA generations are 
needed to obtain a converged solution; therefore, the 
number of generations was increased to 1000. The iden- 
tified connections and kinetic parameters are shown in 
Figure 7(a), with the kinetic orders (gi^) depicted over the 
connections and kinetic rate constants (aij, pij) in brack- 
ets over the genes. The comparison of actual and identi- 
fied time series profiles is shown in Figure 7(b). As 
evident from the figures, the algorithm correctly identi- 
fied all the connections, kinetic orders and rate constants 
with a precision and recall of 1.0, thus verifying the satis- 
factory performance of the algorithm in larger systems. 

III Case Study 3: Experimental Data of E.Coll SOS DNA 
repair 

The proposed algorithm is next applied to the SOS DNA 
repair system of E.Coll [27], based on the gene data 

Table 2 Comparison of the identified S-system rate 
constant values to actual values 



Gene Oj Pi 





actual 


estimated 


actual 


estimated 


XI 


5 


3.8 ±0.2 


10 


18.0 ±0.8 


X2 


10 


13.8 ±0.9 


10 


16.2 ±0.2 


X3 


10 


13.8 ±0.2 


10 


11. 2 ±0.23 


X4 


8 


8.1 ±0.1 


10 


11.8±0.1 


X5 


10 


10.3 ±0.05 


10 


8.9 ±0.03 



Table 3 Effect of added noise on the network 



identification results 



Percent noise 


Recall 


Precision 


5 


1 


0.78 


7 


0.57 


1 


10 


0.57 


1 



White Gaussian noise was added at different amounts to the five-gene 
network in silico data. Each of these data sets was used in the network 
identification algorithm, with their performance measured by the metrics of 
recall and precision. 



measured by Ronen et al [28] which is available online 
[29]. In this model system, the response to DNA damage 
is governed by a few key genes, which in turn regulate 
the expression of more than 30 genes which have specific 
roles in DNA repair. A proposed model is that the RecA 
protein binds to single stranded DNA, and this nucleo- 
protein is integral in LexA cleavage, a transcription 
factor which is a major regulator of the DNA repair 
genes [27]. The work of Ronen et al investigates the 
Michaelis-Menten kinetic parameters associated with 
promoter activity for eight of the major genes in this sys- 
tem. Experimental kinetics were measured by first in- 
corporating a GFP reporter plasmid for each of the 
genes promoter. DNA damage was induced, and the 
resulting GFP intensities were measured. The number of 
GFP molecules is proportional to the promoter activity, 
and can be taken to be analogous to the rate of tran- 
scription [28]. We therefore used this promoter activity 
data [29] to represent gene expression (with the experi- 
mental intensity data normalized by the mean column 
intensity) and used it in our algorithm. Among the four 
data sets provided by the authors, we chose the third 
and fourth for this case study because these are mea- 
sured at the same conditions. Our objective was to iden- 
tify regulatory interactions between six genes: uvrD, 
lexA, umuD, recA, uvrA and polB, 

Identification of this 6 gene network will require 36 
binary variables; hence the GA parameters were retained 
similar to our first case study presented earlier: 20 popu- 
lations evolved through 200 generations. The error toler- 
ance, however, had to be relaxed to a higher value of .7 
because of noise inherent in the experimental data set. 
Figure 8(a) compares the actual experimental data with 
the predicted profiles generated from the identified algo- 
rithm, which shows excellent agreement. 

In the next step the robust connections of the identi- 
fied network are further analyzed by bootstrapping the 
experimental data set. Since our previous analysis on the 
first case study demonstrated 200 bootstrap samples to 
be adequate, in this example we generated 300 artificial 
data sets from the original experimental repeats. The 
network identification algorithm was solved at each of 
the data sets to generate 300 alternate networks. The 
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frequency of occurrence of each network connection is 
analyzed over the array of alternate network and connec- 
tions appearing with over 45% frequency are considered to 
be robust Figure 8(b) illustrates the predicted robust net- 
work for the E. Coli data set along with the frequency of 
repeat of each connection. The corresponding estimated 
kinetic orders (g^^ and rate constants {a^, pij) with 90% 
confidence level are shown in Table 4 and Table 5, respect- 
ively. The heat map in Figure 8(c) further shows how well 
the algorithm identifies a robust network . 

Discussion 

In this work, we present an algorithm to identify robust 
regulatory networks from time profiles of gene expres- 
sion data. Our identification algorithm is primarily devel- 
oped on the hypothesis of sparsity of biological network 
connections. In our earlier work we established the 



validity of the hypothesis of sparsity using a simplified 
linear ode representation of gene expression dynamics in 
a deterministic system. Herein we further advance the al- 
gorithm by incorporating more realistic non-linear repre- 
sentation using an S -system formulation of gene 
expression dynamics. The identification algorithm is for- 
mulated as a bi-level optimization problem in which the 
upper level solves an integer programming problem 
while the lower level is a continuous parameter identifi- 
cation problem. Furthermore, we propose a framework 
to incorporate noisy experimental data towards identifi- 
cation of a robust regulatory network. This is done by 
first generating artificial experimental repeats using the 
bootstrapping technique, followed by solving the identifi- 
cation formulation at each of the bootstrap data sets. 
From this library of identified prospective networks we 
isolate the most-repeated network connections which we 
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Figure 6 Deterministic approach to network identification 
under noisy data. Increasing number of replicates were generated 
using 10% noise from the in silico results, averaged, and used in the 
network identification algorithm, with their recall and precision 
quantified. Although three and five replicates were also used, these 
are not shown because no solution was found. 



hypothesize to be a robust connection, having low vari- 
ability to experimental noise. 

The upper level integer programming problem is 
solved using GA. There are several advantages of using 
GA to solve the above problem, the most important 
being that it does not require gradient evaluation. This is 
a significant advantage for the above problem with 
non-linear ode as constraint function. In addition, GA 
starts its search not from a single point in the feasible 
parameter space, but from multiple locations specified in 
the starting population. Hence, it holds the chance of 
converging at global minima, although such convergence 
cannot be guaranteed with GA. However, it also suffers 
from the disadvantage of increased computational cost. 
All the computations reported here have been carried 
out on 2.66 Ghz processer and 16 GB RAM server. The 
computational time for the five gene network without 
noise was 1 hour and the same network with noise was 
2.5 hours. The computational time for the experimental 
data was 3 hours. For the 10 gene network, the genetic 
algorithm needed more generations to converge, result- 
ing in computational time of 11 hours. Hence, extension 
of the current solution procedure to a much larger data 
set will be expensive. While the same formulation will 
still be applicable in a larger system, alternate solution 
procedures are currently being investigated for its exten- 
sion to larger networks. 

In the formulation presented in equation (2), the only 
user defined parameter is the value of the tolerance which 
dictates how closely the model prediction must agree with 
experimental dynamics in order for the network to be con- 
sidered in the overall algorithm. While for an in silico case 
study without noise the tolerance may not play a vital role, 
it will be relevant when evaluating noisy scenarios. Specify- 
ing a low tolerance value (10'^) in our algorithm under 
noisy data failed to identify any network, as would be 



expected. Moreover, using a low tolerance is not advisable 
when using data sets with noisy replicates since we are not 
targeting a profile which exactly fits the noisy data; the tar- 
get is to identify network profiles which describe all the 
noisy scenarios relatively well. On the other hand, a 
relaxed tolerance runs the risk of compromised prediction 
quality. In order to quantitatively evaluate the effect of 
specified tolerance on the identified network structure, the 
bootstrap/ bi-level optimization algorithm was repeated 
on the same 5 -gene dataset with different tolerance values. 
Table 6 illustrates how the precision and recall of the iden- 
tified network changes with altered tolerance values. Quite 
interestingly it is observed that precision is relatively in- 
sensitive to the network tolerance, while recall worsens 
with increased tolerance. This is very encouraging since 
this implies that even with relaxed tolerance the identified 
network does not have false positive connections, although 
false negative connections increase. Increase in false nega- 
tives can be explained by the nature of the objective func- 
tion which tries to minimize the number of connections. 
Hence, relaxed tolerance will always lead to a 
sparser network, as seen in Table 6. This analysis 
indicates that even for a relaxed constraint the algo- 
rithm may fail to identify all the connections but 
the identified connections will always be accurate 
with low probability of false positivity. 

The performance of the developed robust identification 
formulation is illustrated using three different systems. 
The first two case studies are based on in silico data 
which allows for detailed analysis of the performance of 
the algorithm. Overall the algorithm was found to dem- 
onstrate excellent predictive capability both in the small 
5-gene network along with larger 10-gene network. The 
proposed bootstrapping scheme was found to adequately 
capture the precise network from the noisy data as well. 
Encouraged by the in silico results, we applied our algo- 
rithm to dynamic experimental data of a 6-gene network 
responsible for DNA damage repair in E. Coli [28]. While 
verification of the identified network will be difficult for 
this system, the time profile of gene expression data pre- 
dicted by the identified network is in good agreement with 
the experimental data set. A thorough literature search for 
existing knowledge of network interactions revealed that 
quite a few of the predicted connections have been 
reported in parallel studies. Our algorithm inferred the 
regulation of recA, umuD and uvrA by lexA, which is con- 
sistent with the findings reported earlier [12]. Another 
interesting finding is that our results suggest that polB 
does not influence any of the other genes in the system 
(pol B does not up- or down-regulate any other gene), a 
finding which was also reported by Kumura et al Further- 
more, our identified network shows the self-regulation of 
recA, This protein is the main factor responsible for sens- 
ing DNA damage, and has been reported to promote the 
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Figure 7 Results from the 10-gene network, (a) Identified network. Arrows represent positive regulation and the filled circles represent 
negative regulation of the genes. The kinetic orders of each connection are represented above the corresponding connecting lines and the rate 
constants for each gene are shown above the genes, (b) Time profile for the ten gene network. The triangles represent the profile generated 
from the in silico data and the lines represent the predicted profile. 



transcription of itself, thereby promoting damage recogni- 
tion, and other repair genes [27,28]. 

The current approach offers an improvement on existing 
algorithms. Numerous studies have used the 5-gene net- 
work (the current case study I) to test the accuracy and ef- 
ficiency of their network identification methods. A 
comparison between the methods is presented by Kimura 
et aL [12] for the five gene network without noise. While 
most studies do not report the metrics of precision and re- 
call, the accuracy of the results is still commented on. 



Most methods have a shorter computational time than the 
proposed method. However, our algorithm is able to pre- 
dict a perfect network (recall and precision of 1), while the 
other algorithms deviate from this. Therefore, there is a 
trade-off between computational time and accuracy, and 
selection of the most appropriate method for the system of 
interest should be chosen judiciously. Nevertheless, this 
comparison shows that recall and precision are an im- 
provement over many existing algorithms when analyzing 
the 5-gene network. Additional improvements could be 
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Figure 8 Results from the 5-gene experimental E.Coli data, (a) Time profile for tine gene network, based off of tine mean experimental data. 
The triangles represent the experimental data and the lines represent the predicted profile, (b) Identified network structure from experimental 
data for six gene system. The percentage of connections in the bootstrapping samples are marked on the connections, (c) frequency of specific 
connection values shown as heat map (connection coding: 1-uvrD, 2-lexA, 3-umuD, 4-recA, 5-uvrA, 6-polB). 



made on the current approach to decrease computation 
time, such as parallel programming, or by altering the for- 
mulism (e.g. avoiding direct integration of the system of ode). 

Conclusions 

These results show that our bi-level integer optimization 
algorithm is able to effectively identify the topology and 
connection strength of gene regulatory networks, even 



when the gene dynamics are non-linear and noisy in na- 
ture. By using the biological trait of sparsity, the algo- 
rithm optimizes the number of connections in the 
network while maintaining agreement in gene temporal 
profiles with the experimental input data. Even with un- 
certainty and noise in the data, something which is 
unavoidable on an experimental level, our bootstrap- 
ping/identification combination was able to identify a 
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Table 4 Estimated reaction order values of the E. Coli SOS 
DNA repair network (connection coding: 1-uvrD, 2-lexA, 
3-umuD, 4-recA, 5-uvrA, 6-polB) 



Connection Qestimated 



G1G2 


0.9 ±0.04 


G1G3 


1 .4 ± 0.03 


G1G4 


0.9 ±0.04 


G2G3 


0.9 ±0.04 


G2G4 


0.9 ±0.07 


G2G5 


0.8 ±0.08 


G3G4 


1.0 ±0.03 


G3G5 


0.9 ±0.03 


G4G4 


1.3 ±0.07 


G4G5 


-0.7 ±0.05 


G5G1 


1.0±0.14 



robust network. While we have demonstrated the effect- 
iveness of our algorithm on in silico and E, coli data, its 
formulation, biological relevancy, and results are applic- 
able to any gene regulatory network, as long as time- 
series data is available. 

Methods 

S-system representation of gene expression dynamics 

Identification of the regulatory network from time series 
gene expression data first requires modeling the dynamic 
evolution of the individual genes constituting the net- 
work. Here we model gene dynamics as a set of coupled 
non-linear ode following the S-system formulation, 
which captures the non-linearity in gene expression pro- 
files using a power-law kinetic representation. 

For a system with N-genes, the S-system model can be 
represented using equation (1): 

Where Xi is the concentration of the gene /, a and 
represent the kinetic rate constants, g and h represent 
the kinetic orders for the production and degradation 

Table 5 Estimated rate constant values of the E. Coli SOS 
DNA repair network (connection coding: 1-uvrD, 2-lexA, 
3-umuD, 4-recA, 5-uvrA, 6-polB) 



Gene Oj Pi 



XI 


3.2 ±0.28 


8.4 ±0.1 2 


X2 


1.5 ±0.09 


1.6±0.19 


X3 


1.7 ±0.21 


1.5±0.17 


X4 


5.3 ±0.1 6 


1.6 ±0.07 


X5 


1.6±0.16 


2.0 ±0.1 5 


X6 


4.3 ±0.22 


3.8 ±0.1 2 



Table 6 Effect of error constraint on 5-gene network 
identification, 5% noise 



Error 


Precision 


Recall 


Number of connections 


0.13 


0.78 


1.0 


9 


0.20 


0.88 


0.88 


7 


0.25 


0.78 


0.78 


7 


0.30 


0.88 


0.7 


5 


0.35 


0.88 


0.7 


5 



terms, respectively, and n is the total number of species 
in the system, in this case total number of genes in the 
network. In this work, we are using a modification of the 
above equation by assuming that species degradation fol- 
lows a first order kinetics of the corresponding species 
and independent of other species {hij = 1 for / = 0 other- 
wise). While being relevant to biological systems [18], 
this assumption also reduces the unknown parameters 
from 2n(n + 1) to n(n + 2) [16]. 

Network Identification Algorithm 

Our network identification algorithm is primarily based 
on the hypothesis of sparsity of network connections 
governing biological systems. Hence our overall objective 
is to determine the sparsest network which can satisfac- 
torily capture the observed network dynamics. Following 
this idea, the network identification problem is formu- 
lated as an optimization problem with the objective of 
promoting sparsity given the constraint of maximizing 
predictive capacity. Such problem definition results in a 
bi-level optimization problem, where the constraint itself 
is an unconstrained optimization problem. In the current 
formulation using S-system to model the gene expres- 
sion level (equation (1)), the kinetic orders (gij) are 
decomposed into two parts: binary part, Xij, which deter- 
mines the existence of the connection; and continuous 
part, pip representing the nature and strength of inter- 
action for an existing connection. A value of 1 of the bin- 
ary variable \^ would indicate the presence of the 
corresponding connection Xi ^ Xp while value of 0 indi- 
cates its absence. These binary variables are optimized in 
the upper level which results in an integer programming 
problem. For each chosen network in the upper level, 
the connections are sent to the lower level, where corre- 
sponding pij are optimized to maximize network predic- 
tion and hence minimize deviation of the network 
predictions from the observables. The lower level essen- 
tially optimizes both strength (magnitude) and nature 
(sign) of the existing connections (pij , reactions orders) 
as well as the strengths of the production and degrad- 
ation rate constants (a^, and /^i respectively). Hence it 
results in a continuous non-linear programming problem 
where the objective is to minimize the deviation of the 
predicted profiles from experimental data in a least 
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square sense. A constraint of tolerance (tol) is imposed 
on this minimized error which defines the maximum al- 
lowable deviation in prediction. The mathematical for- 
mulation of the network identification problem in its 
entirety is shown in equation (2): 

subjectto : argminj(A)<to/ 
where 



X-- 



nstep n 



(2) 



t=l i=l 

„ «.n;.4-ftn;.4 

gii = h ■ Pi! 
h =1^^'=' 

I 0, otherwise 

n 

l<^Xij < nx {m-3) 

i,j=i 



Xi J = binary variable 

x^-^^ , x^l^^^ =experimental and predicted gene expression 
levels, respectively 

ai^^ /Si = kinetic rates constants of ith gene's production 
and degradation, respectively 

gij,hij = kinetic orders of production and degradation, 
respectively 

nstep = number of time points 

n = number of genes constituting the network 

m = number of experimental time points 

In the above formulation represents the total 
number of network connections, minimizing which will 
promote sparsity in the network. The upper level integer 
programming is solved using combinatorial optimization 
techniques since combinatorial approach is known to 
handle Lq minimization problems more efficiently than 
approximation algorithms [30]. Of them, evolutionary 
algorithms are particularly efficient in finding a good ap- 
proximate solution for combinatorial problems [31]. In 
this work, we have used genetic algorithm (GA) for solv- 
ing the integer programming problem, while the lower 
level non-linear programming problem is solved using a 
standard least square optimization routine. 

GA is typically designed to handle unconstrained 
optimization problems. One technique for constraint 
handling in GA is by penalty function, where the constraint 
is conditionally incorporated in the objective function. For 
conditions violating the constraint the objective function is 



penalized, and not so otherwise. In the current formulation 
the constraint is incorporated in the objective function 
using the following modification of the objective function: 



mm 



+ penalty * 



max[C, 0] 



c 



(3) 



where C = - 1 



A significant advantage of the bi-level formulation is 
that it allows optimum utilization of experimental data 
by sequentially reducing the number of unknown para- 
meters in the lower level. In a conventional least-square 
parameter estimation problem, the connectivity is fixed 
and includes all possible network connections. There- 
fore, the size of the identifiable system is restricted, gov- 
erned by the availability of experimental data points so 
that number of unknown parameters is less than the 
number of data points. For instance, a single level algo- 
rithm, using the above S -System formulation, would be 
restricted to less than m-3 genes. However, in the 
current bi-level formulation, this restriction is relaxed. 
Because the number of network connections are first 
reduced in the upper level, the number of genes to be 
analyzed is not so restricted, with the only constraint 
coming from the connectivity: 



y^A^y < n X {m-3) 



(4) 



Hence the constraint is imposed on the maximum num- 
ber of binary variables assigned in the upper level, but does 
not constrain the total size of the analyzed network. More- 
over, our primary objective being sparsity of network con- 
nections, the formulation essentially tries to minimize the 
number of connections assigned to 1. Hence, except for 
the very initial phase of GA evolution, the constraint 
defined in equation (2) typically does not become active, 
and never so in the final optimal solution. 

Identification of Robust Networks 

Real world data typically contains noise due to experi- 
mental uncertainty and system stochasticity. Biological 
data are particularly notorious for its inherent heterogen- 
eity and stochasticity [32] . Hence it is important to expli- 
citly account for data variability in order to increase 
confidence in the predicted network. In the presence of 
large experimental repeats it may be possible to deter- 
mine robustness of identified network by repeatedly solv- 
ing the network identification problem at each of the 
experimental data sets and analyzing the connections 
which are heavily repeated. However, drawing statistically 
significant inference would necessitate a large data set 
which is impractical and infeasible. 
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An alternative to actual experimental repeats is to use 
bootstrapping. The purpose of this statistical technique 
is to estimate the distribution of the estimator around 
the unknown true value 6, However, instead of achieving 
this with a large number of individual replicates, boot- 
strapping utilizes resampling of the data. In this way, a 
large number artificial data sets can be generated from a 
limited number of experimental repeats. For each boot- 
strap run, data samples are randomly chosen, with re- 
placement, from the empirical distribution, with the size 
of each artificial set being the same as the experimental 
set (e.g. if the experimental set has 20 data points, so 
would the bootstrap set). For each bootstrap, the estima- 
tors (e.g. mean, variance, or, as in the case of the current 
work, regression parameters) are calculated, and with 
sufficient number of resampled data sets, relevant statis- 
tical information, including confidence intervals, can be 
estimated [26,33]. 

In our algorithm, we are dealing with limited experimen- 
tal data. Hence, following the above methodology, we gen- 
erate a large artificial data set by repeated resampling of 
the limited experimental repeats. Once the bootstrapped 
samples are obtained, the network identification algorithm 
previously described is applied to all bootstrap data sets to 
identify a network corresponding to each. The network 
sets thus obtained is further analyzed to determine the fre- 
quency of occurrence of each connection in the entire set 
of identified networks. We hypothesize that frequent oc- 
currence of network connections in the bootstrap samples 
indicate the insensitivity of the corresponding network to 
experimental noise, and hence claim that connection to be 
robust. 

In order to quantify the quality of prediction of the 
proposed algorithm the measures of recall and precision 
are used, calculated as: 



Additional file 



recall - 



TP 



precision 



TP + FN 
TP 



(5) 



TP + FP 



Where: TP (True Positive) denotes the number of con- 
nections correctly captured; FN (False Negative) denotes 
existing connections which are not captured in the iden- 
tified network; and FP (False Positive) denotes connec- 
tions which are incorrectly captured in the identified 
network. Following the above equation: a low value of re- 
call would indicate a more conservative estimate which 
is unable to capture many of the existing connections; a 
low value of precision will indicate prediction of incor- 
rect connections not appearing in the actual network; 
and a value of 1 will indicate perfect network identifica- 
tion. The flow diagram of the overall network identifica- 
tion algorithm is shown in Figure 1. 



Additional file 1: Additional Equations. The two systems of ordinary 
differer^tial equatior^s showr^ ir^ Additior^al file 1 were those used to 
create the in silico data for cases studies 1 ar^d 2, respectively. 
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